Search CORE

4 research outputs found

Creating an Arabic Dialect Text Corpus by Exploring Twitter, Facebook, and Online Newspapers

Author: Alshutayri A
Atwell E
Publication venue: LREC
Publication date: 01/05/2018
Field of study

Arabic dialects annotation using an online game

Author: Alshutayri A
Atwell E
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/06/2018
Field of study

Modern Standard Arabic is the written standard across the Arab world; but there is an increasing use of Arabic dialects in social media, so this is appropriate as a source of a corpus for research on classifying Arabic dialect texts using machine learning algorithms. An important first step is annotation of the text corpus with correct dialect tags. We collected tweets from Twitter and comments from Facebook and online newspapers, aiming for representative samples of five groups of Arabic dialects: Gulf, Iraqi, Egyptian, Levantine, and North African. Then, we explored an approach to crowdsourcing corpus annotation. The task of annotation was developed as an online game, where players can test their dialect classification skills and get a score of their knowledge. This approach has so far achieved 24K annotated documents containing 587K tokens; 16,179 tagged as a dialect and 7,821 as Modern Standard Arabic

Crossref

White Rose Research Online

Arabic Language WEKA-Based Dialect Classifier for Arabic Automatic Speech Recognition Transcripts

Author: Alosaimy A
Alshutayri A
Atwell ES
Dickins J
Ingleby M
Watson J
Publication venue
Publication date: 13/11/2016
Field of study

This paper describes an Arabic dialect identification system which we developed for the Discriminating Similar Languages (DSL) 2016 shared task. We classified Arabic dialects by using Waikato Environment for Knowledge Analysis (WEKA) data analytic tool which contains many alternative filters and classifiers for machine learning. We experimented with several classifiers and the best accuracy was achieved using the Sequential Minimal Optimization (SMO) algorithm for training and testing process set to three different feature-sets for each testing process. Our approach achieved an accuracy equal to 42.85% which is considerably worse in comparison to the evaluation scores on the training set of 80-90% and with training set 60:40 percentage split which achieved accuracy around 50%. We observed that Buckwalter transcripts from the Saarland Automatic Speech Recognition (ASR) system are given without short vowels, though the Buckwalter system has notation for these. We elaborate such observations, describe our methods and analyse the training dataset

White Rose Research Online

A Social Media Corpus of Arabic Dialect Text

Author: Alshutayri A
Atwell E
Publication venue: Presses universitaires Blaise Pascal
Publication date: 13/06/2019
Field of study

White Rose Research Online